Topics

  • Basic ideas of Bayesian Model Averaging

  • Application of Bayesian Model Averaging using R package BAS

  • Estimation, Interpretation and Prediction

  • Advantages of Bayesian Model Averaging over all-or-none model selection

Libraries

The R package BAS provides ways of carrying out BMA for linear regression, generalized linear models, and survival or event history analysis using Cox proportional hazards models. It contains functions for plotting the BMA posterior distributions of the model parameters, as well as an image plot function that provides a way of visualizing the BMA output. The functions bas.lm provide fast and automatic default ways of doing this for the model classes considered.

library(BAS)
library(tidyverse)
library(broom.mixed)

To illustrate how BMA takes account of model uncertainty about the variables to be included in linear regression, we will be using the nhanes dataset where the variables are described in the file nhanes-codebook.txt. Load this data with the load function and specify the data file.

load(file='nhanes1518.rda')

Exploratory Data Analysis

We first explore the URX predictors (i.e. the ones related to phthalates concentrations), subsetting subset the dataset to include only the predictors in the weighted quantile sum module, and then filter out NA values:

nhanes_URX<-nhanes1518%>%
  select(BMXBMI, URXUCR, URXCNP,URXCOP,URXECP,URXHIBP,URXMBP,URXMC1,URXMEP,URXMHBP,URXMHH)%>%
  na.omit()

We first start exploring the data by plotting a correlation matrix between the variables of interest. We can use the corrplot function within the corrplot() package, and adjust the style to fit our aesthetics of desire. We see that the predictors are highly correlated, especially between URXMHH and URXECP, and URXMHBP and URXMBP.

library(corrplot)
corr_mat=cor(nhanes_URX,method="s")
corrplot(corr_mat, 
         addCoef.col = 1,    
         number.cex = 0.5) # Change font size of correlation coefficients

# other styles:
# default 
# corrplot(corr_mat) # circles
# squares, variables presented in an alphabet order
# corrplot(corr_mat, method = 'color', order = 'alphabet') # squares
# can choose different style for lower and upper triangle, can order the variable by clustering result
#corrplot.mixed(corr_mat, lower = 'shade', upper = 'pie', order = 'hclust')
library(tidyverse)
df_long <- nhanes_URX %>% pivot_longer(everything())

# Plot histograms using ggplot
ggplot(df_long, aes(x = value)) +
  geom_histogram() +
  facet_wrap(~ name, scales = "free")

We also plot the distributions of our variables of interest involved in the model, and notice that the variables have a long tail in the original scale, which hints at log transformation which we will perform later on.

Bayesian Model Averaging

Motivated from chemical mixtures, one way to deal with highly correlated predictors is to select variable, or select model. Traditionally, analysis often proceeds by first selecting the best model according to some criterion and then learning about the parameters given that the selected model is the underlying truth. However, this approach has potential issues: (1) We cannot quantify model uncertainty through these all-or-none selection methods. (2) There are often many modeling choices that are secondary to the main questions of interest but can still have an important effect on conclusions. An alternative is to learn the parameters for all candidate models and then combine the estimates according to the posterior probabilities of associated model. Bayesian Model Averaging (BMA) carry this model combination idea - Specifically, this is done through a parameter estimate obtained by averaging the predictions of the different models under consideration, each weighted by its model probability. The math formula below illustrate below:

Given quantity of interest \(\Delta\), such as an effect size, a future observation, or utility of a course of action, then its posterior distribution of given data D is

\[pr(M_k|D) = \frac{pr(D|M_k)pr(M_k)}{\Sigma_{l=1}^k pr(D|M_l)pr(M_l)}\]

This is an average of the posterior distributions under each of the models considered, weighted by their posterior model probability.

There are three packages available: BAS, BMS, and BMA. A thorough comparison are presented in this paper. In this module, we will investigate this method known as Bayesian Model Averaging using BAS package.

Bayesian Model Averaging for Linear Regression Models

Explain the math formula and say we consider Mk to be linear regression then provide the linear model formula here from this website

We fit a BMA model on the dataset using BMI as the response variable, and all of the 10 URX variables as predictor variables, after log transforming both variables:

As seen in the EDA, we perform log transformation due to the long tail and extreme values in the original scale.

To get started, we will use BAS with the Zellner-Siow Cauchy prior on the coefficients.

nhanes_URX<-log(nhanes_URX)
model <-bas.lm(BMXBMI ~ .,
  data = nhanes_URX,
  prior = "ZS-null",
  modelprior = uniform(), initprobs = "eplogp", method="MCMC",
  force.heredity = FALSE, pivot = TRUE
)
summary(model)
##           P(B != 0 | Y)  model 1     model 2     model 3     model 4
## Intercept     1.0000000   1.0000   1.0000000   1.0000000   1.0000000
## URXUCR        0.9993164   1.0000   1.0000000   1.0000000   1.0000000
## URXCNP        0.9976563   1.0000   1.0000000   1.0000000   1.0000000
## URXCOP        0.6358887   1.0000   0.0000000   0.0000000   1.0000000
## URXECP        0.9985352   1.0000   1.0000000   1.0000000   1.0000000
## URXHIBP       0.9999512   1.0000   1.0000000   1.0000000   1.0000000
## URXMBP        0.9490723   1.0000   1.0000000   1.0000000   1.0000000
## URXMC1        0.0675293   0.0000   0.0000000   0.0000000   1.0000000
## URXMEP        0.9997559   1.0000   1.0000000   1.0000000   1.0000000
## URXMHBP       0.9994629   1.0000   1.0000000   1.0000000   1.0000000
## URXMHH        0.9109375   1.0000   1.0000000   0.0000000   1.0000000
## BF                   NA   1.0000   0.5747392   0.1126271   0.1144826
## PostProbs            NA   0.5219   0.2790000   0.0583000   0.0533000
## R2                   NA   0.2726   0.2716000   0.2703000   0.2729000
## dim                  NA  10.0000   9.0000000   8.0000000  11.0000000
## logmarg              NA 869.0414 868.4875542 866.8577204 866.8740607
##                model 5
## Intercept   1.00000000
## URXUCR      1.00000000
## URXCNP      1.00000000
## URXCOP      1.00000000
## URXECP      1.00000000
## URXHIBP     1.00000000
## URXMBP      0.00000000
## URXMC1      0.00000000
## URXMEP      1.00000000
## URXMHBP     1.00000000
## URXMHH      1.00000000
## BF          0.05824945
## PostProbs   0.02890000
## R2          0.27100000
## dim         9.00000000
## logmarg   866.19837253

When introduce the functions, also explain what the arguments do

Explain the output of summary(). What does each column represent? How does r2, BIC, post prob help us to specify the important variables or the model? Also comment on what valuable information we obtain. For example, what are the important variables? What questions it help us answer?

We need to provide example on how to dig insights using the model instead of purely presenting the R coding. Read the Section 7.2 Example 2 of Bayesian Model Averaging: A Tutorial carefully. Please also read the application part of this paper carefully: Bayesian Model Averaging: Theoretical Developments and Practical Applications

In R, the image function may be used to create an image of the model space that looks like a crossword puzzle:

image(model, rotate=F)

Here, the predictors, including the intercept, are on the y-axis, while the x-axis corresponds to each different model. The red color indicates when the variable estimate is positive, the blue color indicates when the variable estimate is negative, and the beige color is the color to use when the variable is not included in the model. Here the imageplot indicates that most of the URX variables are significant in predicting the BMI level, except for URXMC1 which is not included in all of the six models produced by BMA.

If we view the image by rows, we can see whether one variable is included in a particular model. For each variable, there are only 6 models in which it will appear. For example, we see that URXUCR and URXMEP appear in all the models. URXMHH appears in the top 2 models with larger posterior probabilities, but not the last 4 models.

We can also plot the posterior distributions of these coefficients to take a closer look at the distributions:

coef.model <- coef(model)
# plot(coef.model, mfrow=c(3,3), ask = F)
plot(coef.model, ask = F)

This plot agrees with the summary table we obtained above, which shows that the posterior probability distributions of URXMC1 and URXCOP have a very large point mass at 0, while the distribution of URXUCR and URXMEP have relatively small mass at 0. There is a slighly little tip at 0 for the variable URXUCR, indicating that the posterior inclusion probability of URXECP is not exactly 1. However, since the probability mass for URXUCR to be 0 is so small, that we are almost certain that URXECP should be included under Bayesian model averaging.

Bayesian Model Diagnostics

plot(model, ask = F)

diagnostics(model, type = "pip", pch = 16) 

plot(confint(coef.model),estimator = "HPM") # also look into this

## NULL

Bayesian Model Predictions

With the predict function, we can understand uncertainty besides classification, taking a look at the predictive probabilities for the two class classification problem:

Bayesian Model Predictions with Model Selection

BAS has methods defined to return fitted values, namely fitted, using the observed design matrix and predictions at either the observed data or potentially new values, predict, as with linear regressions

muhat.BMA <- fitted(model, estimator = "BMA")
BMA <- predict(model, estimator = "BMA")

# predict has additional slots for fitted values under BMA, predictions under each model
names(BMA)
##  [1] "fit"         "Ybma"        "Ypred"       "postprobs"   "se.fit"     
##  [6] "se.pred"     "se.bma.fit"  "se.bma.pred" "df"          "best"       
## [11] "bestmodel"   "best.vars"   "estimator"

We plot the two sets of fitted values:

par(mar = c(9, 9, 3, 3))
plot(muhat.BMA, BMA$fit,
  pch = 16,
  xlab = expression(hat(mu[i])), ylab = expression(hat(Y[i]))
)
abline(0, 1)

We see that they are in perfect agreement, which is always true as the posterior mean for the regression mean function at point \(x\) is the perfect expected posterior value for \(Y\) at \(x\).

In addition to using BMA, we can use the posterior means under model selection. This corresponds to a decision rule that combines estimation and selection. BAS currently implements the following options:

Highest Probability Model

HPM <- predict(model, estimator = "HPM")

# show the indices of variables in the best model where 0 is the intercept
HPM$bestmodel
##  [1]  0  1  2  3  4  5  6  8  9 10

Now we explore a little more interpretable version with names:

variable.names(HPM)
##  [1] "Intercept" "URXUCR"    "URXCNP"    "URXCOP"    "URXECP"    "URXHIBP"  
##  [7] "URXMBP"    "URXMEP"    "URXMHBP"   "URXMHH"

Median Probability Model

MPM <- predict(model, estimator = "MPM")
variable.names(MPM)
##  [1] "Intercept" "URXUCR"    "URXCNP"    "URXCOP"    "URXECP"    "URXHIBP"  
##  [7] "URXMBP"    "URXMEP"    "URXMHBP"   "URXMHH"

This is the model where all predictors have an inclusion probability greater than or equal to 0.5. This coincides with the HPM if the predictors are all mutually orthogonal, and in this case is the best predictive model under squared error loss.

Note that we can also extract the best model from the attribute in the fitted values as well.

Best Predictive Model

In general, the HPM or MPM are not the best predictive models, which from a Bayesian decision theory perspective would be the model that is closest to BMA predictions under squared error loss.

BPM <- predict(model, estimator = "BPM")
variable.names(BPM)
##  [1] "Intercept" "URXUCR"    "URXCNP"    "URXCOP"    "URXECP"    "URXHIBP"  
##  [7] "URXMBP"    "URXMEP"    "URXMHBP"   "URXMHH"

Now we use ggpairs to see how these models compare:

library(GGally)
GGally::ggpairs(data.frame(
  HPM = as.vector(HPM$fit), # this used predict so we need to extract fitted values
  MPM = as.vector(MPM$fit), # this used fitted
  BPM = as.vector(BPM$fit), # this used fitted
  BMA = as.vector(BMA$fit)
)) # this used predict

Using the se.fit = TRUE option with predict we can also calculate standard deviations for prediction or for the mean and use this as input for the confint function for the prediction object.

BPM <- predict(model, estimator = "BPM", se.fit = TRUE)
crime.conf.fit <- confint(BPM, parm = "mean")
crime.conf.pred <- confint(BPM, parm = "pred")
plot(crime.conf.fit)

## NULL
plot(crime.conf.pred)

## NULL

Prior Selection

BAS uses a model formula similar to lm to specify the full model with all of the potential predictors. Here we are using the shorthand . to indicate that all remaining variables in the data frame provided by the data argument. Different prior distributions on the regression coefficients may be specified using the prior argument, and include

  • “BIC”
  • “AIC
  • “g-prior”
  • “hyper-g”
  • “hyper-g-laplace”
  • “hyper-g-n”
  • “JZS”
  • “ZS-null”
  • “ZS-full”
  • “EB-local”
  • “EB-global”

Why Bayesian Model Averaging? Advantages

Model Uncertainty

  • It is important to take account of model uncertainty about statistical structure when making inferences. Oftentimes, there is remaining uncertainty not only about parameters, but also about the underlying true model. In this case, a Bayesian analysis allows one to take into account not only uncertainty about the parameters given a particular model, but also uncertainty across all models combined.

Simultaneous Scenarios

  • Allows users to incorporate several competing models in the estimation process. In theory, BMA provides better average predictive performance than any single model that could be selected. BMA avoids the all-or-nothing mentality that is associated with classical hypothesis testing, in which a model is either accepted or rejected wholesale. In contrast, BMA retains all model uncertainty until the final inference stage, which may or may not feature a discrete decision.

Model Misspecification

  • BMA is relatively robust to model misspecification. If one does select a single model, then one had better be sure of being correct. With BMA, a range of rival models contribute to estimates and predictions, and chances are that one of the models in the set is at least approximately correct.